1 Data Introduction

All the datasets used in this report are from the Bay of Fundy. There are 35 SectorIDs and 637 SiteIDs in the Bay of Fundy.

1.1 Loading the Dataset for Bay of Fundy

SectorID SiteID Latitude Longitude log_FC Date normalize_tide
1410 13501 44.1862 -66.173 0.6419 2004-06-16 0.8149038
1410 13501 44.1862 -66.173 1.9459 2004-07-14 0.8350515
1410 13501 44.1862 -66.173 0.6419 2004-08-12 0.7380934
1410 13501 44.1862 -66.173 0.6931 2004-08-26 0.6124402
1410 13501 44.1862 -66.173 1.6094 2004-09-01 0.5346154
1410 13501 44.1862 -66.173 1.6094 2007-06-05 -0.0413462
SiteID Tide_Level Tide_Speed Week_rescaled Year_rescaled Latitude_rescaled
13501 HR 0.1682692 0.50 0.2173913 0
13501 HF -0.0979381 0.58 0.2173913 0
13501 HR 0.2197802 0.68 0.2173913 0
13501 HF -0.2583732 0.72 0.2173913 0
13501 HR 0.4230769 0.72 0.2173913 0
13501 MR 0.4615617 0.48 0.3478261 0
  • Excluded data containing ‘E’ or NA.

  • Excluded data before the year 1999.

  • Applied min-max normalization to the Tide value [-1,1].

  • Applied a logarithmic transformation (base 10) to the fecal coliform counts.

  • Tide Speed is the difference between the two tidal hours in the original data.

    • Positive values mean rising.
    • Negative values mean the falling.



1.2 The Heatmap for Site vs. Year

This heatmap shows the distribution of observations for different SiteIDs across years from 1999 to 2022 in the Bay of Fundy.

  • Red cells indicate SiteIDs and years where there are 2 or more observations

  • Grey cells indicate SiteIDs and years where there are fewer than 2 observations.

  • By looking at the vertical distribution of red and grey cells, you can observe temporal trends.

  • The horizontal distribution of colors across SiteIDs can reveal spatial patterns.

  • Areas with continuous grey cells or large gaps in the heatmap can indicate periods or SiteIDs with missing or no data.

Summary:

  • Observation Densities: Recent years (especially 2011, 2013-2022) show high observation densities across many SiteIDs.

  • Consistent Monitoring: Some SiteIDs have been consistently monitored over the years, as indicated by continuous red cells.

  • Gaps in Data: Early years (1999-2003) show more gaps, indicating fewer observations or missing data.

1.3 Tide Level

Here are the boxplots for Tide level. We have Start_WT and End_WT in the original data. When we create boxplots with tide value and tide state, there are many outliers. After that, we re-leveled the Tide level based on the tide value and followed the Tide calculation instructions that we received via email to re-level the tide level.

1.3.1 The Boxplot for Tide Level

1.3.2 Tukey’s HSD for Tide Level

The plot shown is a result of Tukey’s Honest Significant Difference (HSD) post hoc test, which was performed after conducting an ANOVA (Analysis of Variance) to compare the mean log-transformed fecal coliform counts (log(FC)) across different Tide_Level groups.

  • Each line represents the difference in mean log(FC) between two Tide_Level groups.

  • The labels on the left indicate the pairs being compared (e.g., HR-HF, LF-HF).

  • The blue bars are the 95% confidence intervals for these differences.

  • If the confidence interval for a pair does not cross the dashed vertical line at zero, the difference between the means of those two groups is statistically significant, otherwise, not significant.

  • Left of Zero (Negative Mean Difference): If the entire confidence interval is to the left of the dashed vertical line at zero, it indicates that the first group in the pair has a significantly lower mean log(FC) compared to the second group.

  • Right of Zero (Positive Mean Difference): If the entire confidence interval is to the right of the dashed vertical line at zero, it indicates that the first group in the pair has a significantly higher mean log(FC) compared to the second group.

1.3.2.1 Tide Speed >= 0 (Rising Tide) and Tide Speed < 0 (Falling Tide)

  • Blue color for the Tide Speed >= 0 (Rising Tide).

  • Red color for the Tide Speed < 0 (Falling Tide).

Summary for Tide Speed >= 0 (Rising Tide):

  • HR-HF: The confidence interval is to the right of zero, indicating that HR has a significantly higher mean log(FC) than HF.

  • LF-HF: The confidence interval is to the right of zero, indicating that LF has a significantly higher mean log(FC) than HF.

  • MF-HF: The confidence interval is to the right of zero, indicating that MF has a significantly higher mean log(FC) than HF.

  • MR-HF: The confidence interval is to the right of zero, indicating that MR has a significantly higher mean log(FC) than HF.

  • LF-HR: The confidence interval is to the right of zero, indicating that LF has a significantly higher mean log(FC) than HR.

  • MF-LF: The confidence interval is to the left of zero, indicating that MF has a significantly lower mean log(FC) than LF.

Summary for Tide Speed < 0 (Falling Tide):

  • LF-HF: The confidence interval is to the right of zero, indicating that LF has a significantly higher mean log(FC) than HF.

  • MF-HF: The confidence interval is to the right of zero, indicating that MF has a significantly higher mean log(FC) than HF.

  • MF-LF: The confidence interval is to the left of zero, indicating that MF has a significantly lower mean log(FC) than LF.

1.4 Sample Size for Each SectorID

This plot is a collection of scatter plots, each representing a different SectorID.

  • Each subplot represents a different SectorID, showing the distribution of log-transformed fecal coliform counts (log(FC)) over the weeks of the year.

  • The x-axis represents the weeks of the year (from 1 to 52), and the y-axis represents the log-transformed fecal coliform counts (log(FC))

-The density of points in each subplot indicates the number of observations (sample size) for each SectorID across different weeks.

  • The distribution of points along the x-axis for each SectorID can show temporal patterns in fecal coliform counts.

  • By comparing subplots, we can see differences and similarities in the distribution of log(FC) values among different SectorIDs.

  • Some SectorIDs might have more data points, indicating more frequent or consistent sampling, while others have fewer points.

1.4.1 Full Dataset

Summary:

  • Some sectors like SectorID 680, 697, and 706 - 710 show a large number of observations spread throughout the weeks, indicating these sectors have been consistently sampled.

  • Some sectors like SectorID 667, 675, 686, 689 and 375409 have fewer data points, indicating less frequent sampling.

  • Seasonal Trends: In some sectors, like SectorID 1411-1451, there are visible clusters of observations in certain weeks, indicating that samples for these sectors were only taken during limited time periods.

1.4.2 Tide Speed >= 0 (Rising Tide)

This plot is similar to the previous one, but it specifically shows the sample size for each SectorID when the tide speed is greater than or equal to 0, which corresponds to the Rising Tide condition.

1.4.3 Tide Speed < 0 (Falling Tide)

This plot is similar to the previous one, but it specifically shows the sample size for each SectorID when the tide speed is less than 0, which corresponds to the Falling Tide condition.



2 Fitted Model

We have fitted two types of models:

2.1 Linear Mixed Effect Model

For the linear model, we used the linear mixed-effects model, which plays an important role in longitudinal data analysis. Longitudinal data involves repeated observations of the same subjects over a period of time. In our study, the data is the longitudinal data on fecal coliform counts from different sites within different sectors over multiple years. Here is our model

\[ \log(\text{FC}_{ijk}) = \beta_0 + \beta_1 \cdot \text{normalize_tide}_{ijk} + u_{ijk} \]

\[ u_{ijk} = b_i + c_{ij} + d_{ijk} + \epsilon_{ijk} \]

Where:

  • \(\log(\text{FC}_{ijk})\) is the outcome, the logarithm transformation (base 10) of the fecal coliform count for the \(k\)-th observation within the \(j\)-th site within the \(i\)-th sector.

  • \(\beta_0\) is the overall intercept.

  • \(\beta_1\) is the coefficient for the fixed effect of normalize_tide.

  • \(\text{normalize_tide}_{ijk}\) is the fixed-effect, the normalized tide value for the \(k\)-th observation within the \(j\)-th site within the \(i\)-th sector.

  • \(u_{ijk}\): The combined random effect term (including sector, site within sector, date, and residual error).

  • \(b_i\): The random intercept for the \(i\)-th sector, \(b_i \sim N(0, \sigma^2_b)\).

  • \(c_{ij}\): The random intercept for the \(j\)-th site within the \(i\)-th sector, \(c_{ij} \sim N(0, \sigma^2_c)\).

  • \(d_{ijk}\): The random effect associated with the Date for the \(k\)-th observation within the \(j\)-th site within the \(i\)-th sector, \(d_{ijk} \sim N(0, \sigma^2_d)\).

  • \(\epsilon_{ijk}\): The residual error term, \(\epsilon_{ijk} \sim N(0, \sigma^2)\).

This linear mixed-effects model helps us to account for the nested structure of our data and the correlation between repeated measurements within the same site and sector.

  • Fixed effects represent the overall impact of predictors that are assumed to be the same across all observations. In this model, normalize_tide is the fixed effect.

  • Random effects account for variations at different levels of the data structure that are not explained by the fixed effects. In this model, \(b_i\), \(c_{ij}\), and \(d_{ijk}\) are random effects, representing the variability among sectors, sites, and dates, respectively.

2.1.1 Linear Mixed Effect Model Result for Full Dataset

Dataset Estimate ß1 P-Value
Full data -0.068 2.6e-04
Tide Speed >= 0 0.018 5.2e-01
Tide Speed < 0 -0.131 4.1e-07

Summary:

  • The normalized tide value has a significant negative effect on fecal coliform counts for the Full Dataset.

  • When considering only Tide Speed >= 0, the effect of normalized tide value on fecal coliform counts is positive but not statistically significant.

  • For Tide Speed < 0, the normalized tide value has a significant negative effect on fecal coliform counts.

2.1.2 Linear Mixed Effect Model Result for Summer Dataset (20 <= Week <= 40)

Dataset Estimate ß1 P-Value
Summer Full data -0.119 3.6e-07
Summer Tide Speed >= 0 -0.038 2.7e-01
Summer Tide Speed < 0 -0.186 7.1e-09

Summary:

  • For the Summer Dataset, the significance levels for both the coefficient and p-value are better than for the full dataset, indicating a stronger effect in the Summer Dataset.

  • The normalized tide value has a significant negative effect on fecal coliform counts for the Summer Dataset.

  • When considering only Tide Speed >= 0, the effect of normalized tide value on fecal coliform counts is negative but not statistically significant.

  • For Tide Speed < 0, the normalized tide value has a significant negative effect on fecal coliform counts.

2.2 Generalized Additive Model (GAM) Model

GAMs model are flexible models that allow for non-linear relationships between the predictors and the response variable. They use smooth functions to model these relationships, providing a more accurate fit to the data compared to traditional linear models

The GAM Model is:

\[ \log(\text{FC}_i) = f_1(\text{Week_rescaled}_i) + f_2(\text{Week_rescaled}_i)\cdot \text{normalize_tide}_i + \epsilon_i \]

Where:

  • \(\log(\text{FC}_i)\) is the logarithmic transformation (base 10) of the fecal coliform count for the \(i\)-th observation.

  • \(f_1(\text{Week_rescaled}_i)\) is a smooth function of Week_rescaled.

  • \(f_2(\text{Week_rescaled}_i) \cdot \text{normalize_tide}_i\) is a smooth function of Week_rescaled modified by normalize_tide.

  • \(\epsilon_i \sim N(0, \sigma^2)\) is the residual error term.

  • The smooth function \(f_1(\text{Week_rescaled})\) is constructed using cyclic cubic splines, which are particularly useful for periodic data, such as weekly or seasonal data, ensuring that the smooth function meets end-to-end continuity requirements.

2.2.1 GAMs Model for Week vs. Year

Objective: Illustrates how log(FC) varies with Week_rescaled (on the x-axis) and Year_rescaled (on the y-axis).

  • The 3D surface highlights the interaction effect between the week and year on fecal coliform counts.

  • Weekly Patterns: The plot shows variations in fecal coliform levels over the weeks of the year. Peaks and troughs along the Week_rescaled axis indicate seasonal trends or periodic fluctuations in fecal coliform counts.

  • Yearly Patterns: The plot also reveals changes in fecal coliform levels over the years. Trends along the Year_rescaled axis can indicate whether fecal coliform counts have generally increased, decreased, or remained stable over time.

Summary:

Seasonal Trends

  • There is a noticeable increase in log(FC) values during the mid-year weeks, indicating usually higher fecal coliform counts in the late summer and fall seasons (weeks 26 to 41). The peak occurs around weeks 35 to 36.

  • During late winter and early spring (weeks 6 to 16), there seem to be relatively lower fecal coliform counts, as indicated by the blue areas.

Long-term Trends Over 20 Years

  • The log(FC) values tend to increase and decrease in waves over the years. For example, we can see a peak around 1999, 2011, and 2021, but lower values around 2003 and 2016.

2.2.2 GAMs Model for Week vs. Latitude

Objective: Illustrates how log(FC) varies with Week_rescaled (on the x-axis) and Latitude_rescaled (on the y-axis).

  • The 3D surface highlights the interaction effect between the week and latitude on fecal coliform counts.

  • Latitude Patterns: The plot reveals how fecal coliform levels vary across different latitudes. Trends along the Latitude_rescaled axis can indicate whether certain latitudes consistently have higher or lower fecal coliform counts.

Summary:

  • A common trend is observed around week 21, with a noticeable increase in log(FC) values for most latitudes, peaking around weeks 35 to 36, and then decreasing.

  • The highest log(FC) values (yellow and red regions) are observed around the latitude range of 44.19 to 44.48, and 45.22, suggesting that these latitudes might experience higher fecal coliform levels.

2.2.3 GAMs Model for log(FC) and normalize_tide

The plots generated by the GAMs modelprovide insight into the relationships between the predictor variables ‘normalize_tide’ and the response variable log(FC).

  • Each plot represents a smooth term in the GAM model. The solid line represents the estimated smooth effect of the predictor variable, while the dashed lines represent the 95% confidence intervals.

  • The y-axis shows the effect size, and the x-axis shows the predictor variable (Week_rescaled).

  • The red horizontal line at y = 0 serves as a reference for no effect.

2.2.3.1 Results for Full Dataset

Summary:

  • For most of the weeks, normalize_tide shows a negative relationship with log(FC), as evidenced by the smooth effect being below the red reference line for all three dataset.

  • These trends highlight that the effect of normalize_tide on fecal coliform counts varies throughout the year, showing both negative and positive relationships depending on the week and tide speed conditions.

2.2.3.2 Results for Summer Data

Summary:

  • All the three plots show that normalize_tide has a negative relationship with log(FC), as evidenced by the smooth effect being below the red reference line in all three datasets for the summer dataset.